Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Switch DB to postgres #10

Merged
merged 1 commit into from
Jan 23, 2024
Merged

feat: Switch DB to postgres #10

merged 1 commit into from
Jan 23, 2024

Conversation

bjchambers
Copy link
Contributor

This uses asyncpg directly.

This removes the tags from the methods, and changes the names to be more descriptive. The intent is to have a single service generated from the OpenAPI specification that includes methods like retrieve_chunks rather than chunks.retrieve(...).

This uses asyncpg directly.

This removes the tags from the methods, and changes the names to be
more descriptive. The intent is to have a single `service` generated
from the OpenAPI specification that includes methods like
`retrieve_chunks` rather than `chunks.retrieve(...)`.
@@ -27,26 +27,26 @@ class RetrieveRequest(BaseModel):
"""Whether to include a generated summary."""


class BaseStatement(BaseModel):
class BaseChunk(BaseModel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attempting to s/Statements/Chunk



@router.post("/retrieve")
async def retrieve(store: StoreDep, request: RetrieveRequest) -> RetrieveResponse:
"""Retrieve statements based on a given query."""
async def retrieve_chunks(store: StoreDep, request: RetrieveRequest) -> RetrieveResponse:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource Naming: I think it makes sense for this to be /chunks/retrieve since it operates on chunks (and conceivably we could have other methods, like list chunks, etc.)

Method Naming: I can see retrieve (if this is the only retireve method, so we have service.retrieve(...)) but called it retrieve_chunks for consistency with other things that name the resource. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrieve_chunk seems fine - it's explicit about what's being returned which may be good.

It will be easier (in the UI) to treat this as a GET with query params until we have enough query complexity to warrant a full-on post body - it would also simplify the API a bit, allowing this to be treated as just the list view over chunks. I don't want to overly emphasize the FE's needs though.

collection_validator = TypeAdapter(Collection)


class CollectionCreate(BaseModel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we want to standardize naming. Basically,I have:

  • Collection the model that we generally return.
  • CollectionCreate (for creation). Could also do CreateCollection or CreateCollectionInput, CreateCollectionRequest, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Litestar at least goes with <Resource> (for the result), <Resource>Create (for the creation request), and <Resource>Update for update requests:

class Author(BaseModel):
    id: UUID | None
    name: str
    dob: date | None = None

class AuthorCreate(BaseModel):
    name: str
    dob: date | None = None

class AuthorUpdate(BaseModel):
    name: str | None = None
    dob: date | None = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually ended up with something like CreateCollectionRequest and CreateCollectionResponse in the gRPC days, but maybe that's more verbose than is necessary in this case.

session.refresh(collection)
return collection
result = await conn.fetchrow("""
INSERT INTO collection (name) VALUES ($1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think asyncpg prepares statements using an LRU cache and the hash of the query. We could (maybe) manually manage that for more efficiency... but should be unnecessary -- we'll likely only have one or two queries per request, so the hash shouldn't be too bad.

"store": Store(),
"engine": engine,
}
# if settings.APPLY_MIGRATIONS:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to get yolo to run in the application, but it seems to only support older DB versions. I suspect our best path here is to just migrate ourselves, at least until we do migrations outside of docker compose. We could just have a loop over migration files or smoething if we really want. For now, did somethnig super stupid.

See https://gist.github.com/mattbillenstein/270a4d44cbdcb181ac2ed58526ae137d for an example script we could use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I say KISS

- db:/var/lib/postgresql
networks:
- kb-network
healthcheck:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes things that depend on postgres to wait until it reports ready. I'd like to run migrations as part of that, but that is messy.


CREATE TABLE collection (
id SERIAL NOT NULL,
name VARCHAR NOT NULL,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point soon, would like to add a tenant table, and tenant_id to the collection.

FOREIGN KEY(document_id) REFERENCES document (id)
);

-- CREATE TABLE ingestion(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to split ingestions out. This could make it possible to add a new ingestion (eg., reconfigure things, and re-ingest) to existing documents, as well as supports changes to documents.

from app.ingest.extract import extract
from app.ingest.extract.source import ExtractSource
from app.ingest.store import Store, StoreDep

router = APIRouter(tags=["documents"], prefix="/documents")
# TODO: Move this to `/documents`. Will require figuring out
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this TODO. I'd like to flatten it, but that is tricky for the list_documents case (which collection do we list). Also, need to make sure we can do file upload with a JSON body in the add_document. So, deferring to a separate PR.



@router.post("/retrieve")
async def retrieve(store: StoreDep, request: RetrieveRequest) -> RetrieveResponse:
"""Retrieve statements based on a given query."""
async def retrieve_chunks(store: StoreDep, request: RetrieveRequest) -> RetrieveResponse:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrieve_chunk seems fine - it's explicit about what's being returned which may be good.

It will be easier (in the UI) to treat this as a GET with query params until we have enough query complexity to warrant a full-on post body - it would also simplify the API a bit, allowing this to be treated as just the list view over chunks. I don't want to overly emphasize the FE's needs though.

collection_validator = TypeAdapter(Collection)


class CollectionCreate(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually ended up with something like CreateCollectionRequest and CreateCollectionResponse in the gRPC days, but maybe that's more verbose than is necessary in this case.

"store": Store(),
"engine": engine,
}
# if settings.APPLY_MIGRATIONS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I say KISS

@bjchambers bjchambers merged commit 3277785 into main Jan 23, 2024
@bjchambers bjchambers deleted the postgres branch January 26, 2024 22:34
@bjchambers bjchambers added the enhancement New feature or request label Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants